Author: Kelli Davies

Introduction

The Ames, Iowa housing data set contains information from the Ames Assessor’s Office used to compute assessed values for residential properties from 2006 to 2010. There are 2930 observations and 82 variables.
Sources: Ames, Iowa Assessor’s Office
Data set details: https://ww2.amstat.org/publications/jse/v19n3/decock/datadocumentation.txt

In this project, I developed a linear regression model to predict housing prices from the Ames data set. An exploratory data analysis identified variables that are associated with price and described the effects of log transformations on variable distributions and correlation with price. Models were selected with AIC, BIC, and Bayesian model selection and evaluated with residual plots and RMSE values for train and test sets. A simple model consisting of only 6 variables resulted in strong price predictions and had an AdjR2 of over 0.84. After analysis of anomalies in home prices from non-normal sales, the model performance was improved by modeling normal sales only. A larger model with 17 explanatory variables had an AdjR2 of over 0.92 with improved RMSE scores and the model exhibited slightly less than true coverage probability (0.947 vs 0.950).

The data set files were obtained from Duke’s Statistics MOOC and the dataset had already been broken into a train set (1000 observations), a test set (816 observations), and a validation set (763 observations).


Contents

The data consists of an EDA followed by selection/evaluation of a small model and then a larger model to predict housing prices.
Clicking on the headers throughout the document returns to the main table of contents here.

Part 1. EDA

1.1 The Ames dataset contains few missing values
1.2 The price distribution is more normally distributed with log transformation
1.3 Area metrics are more normally distributed with log transformation
1.4 Area and Price exhibit the best correlation when log tranformed
1.5 Year.Sold is associated with modest differences in price
1.6 Quality, but not Condition, is strongly associated with house price
1.7 Location is associated with price and quality differences
1.8 Correlation matrix identifies additional numerical variables associated with price

Part 2. Initial Model

2.1 Simple linear regression model with 6 explanatory variables
2.2 Model selection and feature analysis with BIC, AIC, and BMA produce similar models
2.3 Residual Plots of AIC Selected Model and Identification of Outliers
2.4 Relative Model Performance w/ RMSE: AIC selected model performs best
2.5 Analysis of Sale.Condition: Modeling normal sales only improves overall model

Part 3. Final Model

3.1 Feature Engineering and Linear Regression Model with 21 Explanatory Variables
3.2 BIC and AIC selection methods produce different models
3.3 Residual Plots of AIC Selected Model
3.4 Relative Model Performance w/ RMSE: AIC selected model again performs best
3.5 Based on coverage probability, the model correctly reflects uncertainty
3.6 Conclusions

Load Training Data and packages

Analysis of the training set.

load("ames_train.Rdata")
library(statsr)
library(MASS)
library(plyr)
library(dplyr)
library(BAS)
library(ggplot2)
library(scales)
library(gridExtra)
library(knitr)
library(kableExtra)
library(corrplot)
options(knitr.table.format = "html") 

Part 1 - Exploratory Data Analysis (EDA)

Before modeling, explanatory data analysis of the training set is used to better understand variable distributions and relationships among variables. The EDA identifies the best candidate variables to include in the model.

1.1 The Ames dataset contains few missing values

First, the missing amount of data in the train set is calculated to understand the completeness of the dataset and which variables may require additional processing or imputation before use in modeling.

length = dim(ames_train)[1]
missing_data <- data.frame("Feature" = c(names(ames_train)), "Frac_NA" = colSums(is.na(ames_train))/length)
missing_data <- missing_data[order(-missing_data$Frac_NA),]
kable(missing_data[which(missing_data$Frac_NA > 0),], row.names = FALSE) %>%
  kable_styling("striped", full_width = F)
Feature Frac_NA
Pool.QC 0.997
Misc.Feature 0.971
Alley 0.933
Fence 0.798
Fireplace.Qu 0.491
Lot.Frontage 0.167
Garage.Yr.Blt 0.048
Garage.Qual 0.047
Garage.Cond 0.047
Garage.Type 0.046
Garage.Finish 0.046
Bsmt.Qual 0.021
Bsmt.Cond 0.021
Bsmt.Exposure 0.021
BsmtFin.Type.1 0.021
BsmtFin.Type.2 0.021
Mas.Vnr.Area 0.007
BsmtFin.SF.1 0.001
BsmtFin.SF.2 0.001
Bsmt.Unf.SF 0.001
Total.Bsmt.SF 0.001
Bsmt.Full.Bath 0.001
Bsmt.Half.Bath 0.001
Garage.Cars 0.001
Garage.Area 0.001

Of the 82 variables in the dataset, the 25 above have NA values. The pool quality variable (Pool.QC) has the greatest percentage of NA values (99.7%). This makes sense because most houses in Iowa probably do not have pools, and so there is no applicable value to put in the field. Similarly, homes may not have an alley, fence, fireplace, garage or basement. For many variables, the documentation confirms that “NA” corresponds to not having the specified feature and so these are not actually missing data.

Further analysis of the missing values suggests that, for a given feature, “NA” can mean both not having the feature or having missing data for the feature. For example, for the Garage categorical variables (Garage Qual, Type, etc), the documentation states that ‘NA’ corresponds to “no garage”. Based on the 4.6% NA for Garage.Type, no more than 4.6% of houses then do not have garages. For related variables such as Garage Qual and Yr Blt, there are slightly more NA values (4.7% and 4.8%) than the 4.6% for Garage.Type, indicating a small amount of true missing data. If variables with missing data are selected for the model, a subset of homes will not have values for these features and so the homes will have to either need to be omitted from the analysis or the missing values will need to be imputed. For categorical variables, “NA” can be reassigned as a category indicating feature absence and there is also the option to generate binary yes/no variables to indicate whether the feature is present in the home.

When considering that “NA” often means the feature is not present in the home, the dataset looks very complete.


1.2 The price distribution is more normally distributed with log transformation

Next, the distribution of the responses variable, price, is assessed for normality using histograms and Q-Q plots. Variables such as price have the potential for a strong right skew, which can make modeling difficult. Log transformations can make the data more normally distributed and allow for a better linear fit between response and explanatory variables.

means <- ames_train %>% summarise(mean=mean(price), logmean = mean(log(price)))
## Warning: package 'bindrcpp' was built under R version 3.4.4
medians <- ames_train %>% summarise(median=median(price), logmedians = median(log(price)))

pr_mean <- as.numeric(means[1,1])
pr_med <- as.numeric(medians[1,1])
p1 <- ggplot(data=ames_train, aes(x=price)) + geom_histogram(bins=30) + ggtitle("Price Distribution") + 
  labs(x="House price ($)") + geom_vline(xintercept=pr_mean, linetype="solid", color="red", size=.5)  +
  geom_vline(xintercept=pr_med, linetype="solid", color = "blue", size=.5) + 
  annotate("text", label="mean", x=pr_mean+25000, y = 140, color='red')  + 
  annotate("text", label="median", x=pr_med-30000, y = 135, color='blue')

pr_mean <- as.numeric(means[1,2])
pr_med <- as.numeric(medians[1,2])
p2 <- ggplot(data=ames_train, aes(x=log(price))) + geom_histogram(bins=30) + ggtitle("log(price) Distribution") +
  labs(x="log(price)") + geom_vline(xintercept=pr_mean, linetype="solid", color="red", size=.5)  +   
  geom_vline(xintercept=pr_med, linetype="solid", color="blue", size=.5) + 
  annotate("text", label="mean", x=pr_mean+.2, y=140, color='red')  + 
  annotate("text", label="median", x=pr_med-.2, y=135, color='blue')

p3 <- ggplot(ames_train, aes(sample=price)) + stat_qq() + 
  labs(title = "price Q-Q Plot", x="Theoretical Quantiles", y="Sample Quantiles")
p4 <- ggplot(ames_train, aes(sample=log(price))) + stat_qq() + 
  labs(title="Log(price) Q-Q Plot", x="Theoretical Quantiles", y="Sample Quantiles")

summary <- ames_train %>% 
  summarise(price_mean=mean(price), price_median=median(price), sd=sd(price), logprice_mean= mean(log(price)), logprice_median=median(log(price)), logprice_sd=sd(log(price)))

grid.arrange(p1, p2, p3, p4, ncol=2)

kable(summary) %>%
  kable_styling("striped", full_width = F)
price_mean price_median sd logprice_mean logprice_median logprice_sd
181190.1 159467 81909.79 12.01847 11.97959 0.4206

The response variable, price, is not normally distributed. There is a right skew with the mean ($181,190.1) greater than the median ($159,467). The skew is not surprising for price data, as there is no maximum bound for a house price and a subset of the population tends to have much higher income and can therefore afford more expensive houses. A log transformation makes the price variable more normal, as evident by the more symmetrical histrogram and linear Q-Q plot (right panels above). Use of the log transformed price has the potential to provide a better linear fit and a more representative model of housing prices.


1.3 Area metrics are more normally distributed with log transformation

There are several other variables relating to Area that may also have a strong right skew. Two of these variables, area (the square footage of the home) and Lot.Area (square footage of the lot) will be analyzed with and without log transformation:

p5 <- ggplot(ames_train, aes(sample=area)) + stat_qq() + 
  labs(title="area Q-Q Plot", x="Theoretical Quantiles", y="Sample Quantiles")
p6 <- ggplot(ames_train, aes(sample=log(area))) + stat_qq() + 
  labs(title="log(area) Q-Q Plot", x="Theoretical Quantiles", y="Sample Quantiles")
p7 <- ggplot(ames_train, aes(sample=Lot.Area)) + stat_qq() + 
  labs(title="Lot.Area Q-Q Plot", x="Theoretical Quantiles", y="Sample Quantiles")
p8 <- ggplot(ames_train, aes(sample=log(Lot.Area))) + stat_qq() + 
  labs(title="log(Lot.Area) Q-Q Plot", x="Theoretical Quantiles", y="Sample Quantiles")

grid.arrange(p5, p6, p7, p8, ncol=2)

Log transformation results in more normal distributions for area and Lot.Area, as indicated by the more linear Q-Q plots post-transformation. The Lot.Area has a few outliers of large values and even with the transformation is not very linear. The relationship between price and area, with and without log transformation, will now be investigated:


1.4 Area and Price exhibit the best correlation when log tranformed

ames_train["LogPrice"] <- log(ames_train["price"])
ames_train["LogArea"] <- log(ames_train["area"])
corvals <- round(cor(ames_train[, c("price", "LogPrice", "area", "LogArea")]),2)

plot1 <- ggplot(data=ames_train, aes(x=area, y=price)) + geom_jitter() +   
  ggtitle('price v. area') + annotate("text", label=paste("r=", corvals["price", "area"]), x=4000, y=50000)

plot2 <- ggplot(data=ames_train, aes(x=log(area), y=price)) + geom_jitter() +   
  ggtitle('price v. log(area)')+ annotate("text", label=paste("r=", corvals["price", "LogArea"]), x=8, y=50000)

plot3 <- ggplot(data=ames_train, aes(x=area, y=log(price))) + geom_jitter() +   
  ggtitle('log(price) v. area')+ annotate("text", label=paste("r=", corvals["LogPrice", "area"]), x=4000, y=10)

plot4 <- ggplot(data=ames_train, aes(x=log(area), y=log(price))) + geom_jitter() +   
  ggtitle('log(price) v. log(area)')+ annotate("text", label=paste("r=", corvals["LogPrice", "LogArea"]), x=8, y=10)

grid.arrange(plot1, plot2, plot3, plot4, ncol=2)

The graphs above illustrate that area and price are linearly associated and that log transformations of these variables generate a better linear model. With log transformations for both price and log variable, the data look more linear and the Pearson’s r correlation coefficient value is slightly stronger.


1.5 Year.Sold is associated with modest differences in price

The Ames dataset contains sales from 2006 to 2010. Home price values often change over time due to market flunctuations. We will check whether this holds true for sale price of homes in the dataset:

ggplot(ames_train, aes(x=as.factor(Yr.Sold), y=price)) + geom_boxplot() + 
  geom_jitter(color = "blue", alpha = 0.1, width = 0.2) + 
  labs(title='Price vs Yr.Sold', x= "Year Sold", y="price ($)") + scale_y_continuous(labels=comma)

summary <- ames_train %>% group_by(Yr.Sold) %>% 
  summarise(mean=mean(price), median=median(price), sd=sd(price), log.median=median(log(price)))

kable(summary) %>%
  kable_styling("striped", full_width = F)
Yr.Sold mean median sd log.median
2006 182442.2 157500 80572.54 11.96718
2007 185026.9 165500 83662.81 12.01672
2008 182685.2 163000 84025.99 12.00151
2009 176514.2 159717 72610.69 11.98116
2010 176712.1 152000 94060.15 11.93164

Overall, the prices look relatively consistent across the years. The mean and median values show the same trends; there is a slight increase of about 5% in house prices moving from 2006 to 2007 (median values of $157,500 to 165,500). After which, the prices consistently decline 1.5% to 4.8% per year for the next 3 years. As a result, the median house price for the final year (2010) is 3.5% less than for the initial year (2006). Additionally, based on the density of the raw data observations (blue circles), it looks like there are fewer house sales in the final year, 2010. The decline in price beginning at 2008 is expected because house prices across the U.S. began declining at this time. While Year.Sold does not explain the bulk of the variability of house prices, it is associated with modest differences. * * *

1.6 Quality, but not Condition, is strongly associated with house price

Next, the association of quality and condition scores with log and log(price) will be analyzed:

corvals <- round(cor(ames_train[, c("price", "LogPrice", "Overall.Cond", "Overall.Qual")]),2)

plot1 <- ggplot(ames_train, aes(x=as.factor(Overall.Cond), y=price)) + geom_boxplot() + 
  labs(title='Price vs Overall.Cond', x="Overall Condition", y="Price") + 
  annotate("text", label=paste("r=", corvals['price', 'Overall.Cond']), x=8.5, y=50000)

plot2 <- ggplot(ames_train, aes(x=as.factor(Overall.Cond), y=log(price))) + geom_boxplot()  + 
  labs(title='log(Price) vs Overall.Cond', x="Overall Condition", y="log(Price)") + 
  annotate("text", label=paste("r=", corvals['LogPrice', 'Overall.Cond']), x=8.5, y=9.8)

plot3 <-ggplot(ames_train, aes(x=as.factor(Overall.Qual), y=price)) + geom_boxplot() + 
  labs(title='Price vs Overall.Qual', x="Overall Quality", y="Price") + 
  annotate("text", label=paste("r=", corvals['price', 'Overall.Qual']), x=9.5, y=50000)

plot4 <- ggplot(ames_train, aes(x=as.factor(Overall.Qual), y=log(price))) + geom_boxplot()  + labs(title='log(Price) vs Overall.Qual', x="Overall Quality", y="log(Price)") + annotate("text", label=paste("r=", corvals['LogPrice', 'Overall.Qual']), x=9.5, y=9.8)

grid.arrange(plot1, plot2, plot3, plot4, ncol=2)

The relationships of the overall Quality and Condition metrics with price (left column) and log(price) (right column) are displayed above. The Overall Quality variable (bottom plots) is by far a better indictator of the home price than is Overall Condition. One observes a clear increase in the median home value with increasing Overall.Qual, which appears linear when plotted with the log(price). In contrast, the Overall.Condition does not have a clear increase in medians with increasing scores and it has broader, more overlapping distributions. Overall.Cond values from 6 to 10 exhibit a similar price distribution, indicating the Overall.Cond value differences among higher values are not informative at predicting prices.

The Pearson correlation coefficient values are shown to report the strength of the linear relationship. For the Overall.Cond, the linear relationship, if any, is very weak and surprisingly slightly negative, with pearson correlation coefficients of -.14 and -0.07. These indicate that Overall.Cond is not a good variable for price prediction. For the Overall.Qual, the pearson correlation is strong (0.8 and 0.83) when considering the relationship with price and log(price).

Overall, the data look better when plotted with the log(price) rather than untransformed price. Boxplots cannot be used to fully assess normality, but one can observe that the log(price) tends to result in fewer outliers and more centered medians. For Overall.Cond (top graphs), the value 2 has a median centered in the IQR after log transformation. Similarly, the Overall.Cond value 5 has fewer outliers that are more symmetrical around the IQR with the log(price). For the Overall.Qual, the use of log(price) rather than price results in a more linear relationship.

There are several factor variables that also pertain to Condition and Quality, but are reported with descriptive ordinal rankings:

   Po   Poor
   Fa   Fair
   TA   Typical/Average
   Gd   Good
   Ex   Excellent
   

These have associated numeric level values, however the levels are ordered alphabetically (e.g., 1= Excellent, and 5 = Typical/Average). The numeric factor levels will be reassigned based on their ordinal rankings from 1 (Poor) to 5 (Excellent) so they can be plotted with this order to check whether there is any clear linear relationship with log(price), as was observed for Overall Quality above.

modify_levels <- function(df){
  df$Kitchen.Qual <- as.numeric(factor(df$Kitchen.Qual, levels=c("Po", "Fa", "TA", "Gd", "Ex")))
  df$Exter.Qual <- as.numeric(factor(df$Exter.Qual, levels=c("Po", "Fa", "TA", "Gd", "Ex")))
  df$Exter.Cond <- as.numeric(factor(df$Exter.Cond, levels=c("Po", "Fa", "TA", "Gd", "Ex")))
  df$Heating.QC <- as.numeric(factor(df$Heating.QC, levels=c("Po", "Fa", "TA", "Gd", "Ex")))
  df
}

ames_train <- modify_levels(ames_train)

cols <- c("LogPrice", "Overall.Qual", "Overall.Cond", "Kitchen.Qual", "Exter.Qual", "Exter.Cond", "Heating.QC")
corvals <- round(cor(ames_train[, cols]),2)

plot1 <- ggplot(ames_train, aes(x=as.factor(Exter.Qual), y=log(price))) + geom_boxplot() + 
  labs(title='Log(price) vs Exter.Qual', x="Exterior Quality", y="log(price)") + 
  annotate("text", label=paste("r=", corvals[1,"Exter.Qual"]), x=4, y=10)

plot2 <- ggplot(ames_train, aes(x=as.factor(Exter.Cond), y=log(price))) + geom_boxplot() + 
  labs(title='Log(price) vs Exter.Cond', x= "Exterior Condition", y="log(price)") + 
  annotate("text", label=paste("r=", corvals[1,"Exter.Cond"]), x=4, y=10)

plot3 <- ggplot(ames_train, aes(x=as.factor(Kitchen.Qual), y=log(price))) + geom_boxplot() + 
  labs(title='Log(price) vs Kitchen.Qual', x= "Kitchen Quality", y="log(price)") + 
  annotate("text", label=paste("r=", corvals[1,"Kitchen.Qual"]), x=5, y=10)

plot4 <- ggplot(ames_train, aes(x=as.factor(Heating.QC), y=log(price))) + geom_boxplot() + 
  labs(title='Log(price) vs Heating.QC', x= "Heating.QC", y="log(price)") + 
  annotate("text", label=paste("r=", corvals[1,"Heating.QC"]), x=5, y=10)

grid.arrange(plot1, plot2, plot3, plot4, ncol=2)

Based on the Exterior variables (top graphs), Quality again is a much better indicator of price rather than Condition, with r correlation coefficient of 0.69 for Log(price) vs Exter.Qual and an r coefficient of -0.03 for Log(price) vs Exter.Cond. Kitchen.Qual and Heating.QC also display linear relationships with log(price), with r correlation coefficients of 0.68 and 0.52, respectively. Overall.Qual, Exter.Qual, Kitchen.Qual and Heating.QC will be used in an initial simple model.

*Note: In the above analysis, Pearson’s R was used to quantify the linear relationship strength between the numeric explanatory variable of interest with log(price). Kendall’s Tau is often used to determine association for ordinal variables (which we have here for condition and quality metrics). Alternative analysis with Kendall’s Tau tended to result in similar, but weaker, associations:

corvals_k <- round(cor(ames_train[, cols], method='kendall'),2)
kable(data.frame("Pearson.R.with.LogPrice"=corvals[1,], "Kendall.Tau.with.LogPrice"=corvals_k[1,])) %>%
  kable_styling("striped", full_width = F)
Pearson.R.with.LogPrice Kendall.Tau.with.LogPrice
LogPrice 1.00 1.00
Overall.Qual 0.83 0.68
Overall.Cond -0.07 -0.16
Kitchen.Qual 0.68 0.57
Exter.Qual 0.69 0.57
Exter.Cond -0.03 -0.05
Heating.QC 0.52 0.43

1.7 Location is associated with price and quality differences

Another important variable to consider is the Neigborhood metric, as location is known to strongly influence price. A boxplot graphic of log(price) vs neighborhood is used to determine if location is associated with the price differences observed in the ames train dataset:

neighborhood_order <- ames_train %>% group_by(Neighborhood) %>% 
  summarize(med_price = median(price)) %>% 
  arrange(med_price) %>% pull(Neighborhood)

ames_train$Neighborhood_ordered <- factor(ames_train$Neighborhood, levels = neighborhood_order)

summary <- ames_train %>% 
  filter(Neighborhood %in% c('StoneBr', 'NridgHt', "MeadowV", "BrDale", "IDOTRR")) %>%  
  group_by(Neighborhood) %>% 
  summarize(mean=mean(price), median=median(price), min=min(price), max=max(price), sd=sd(price), 
            cv= 100*(sd(price)/mean(price)))

ggplot(ames_train, aes(x=Neighborhood_ordered, y=log(price))) + geom_boxplot() +  
  geom_jitter(color="blue", alpha=0.1, width = 0.2) + scale_y_continuous(labels=comma) + 
  ggtitle("Log(price) by Neighborhood") + theme(axis.text.x=element_text(angle=45, hjust=1))

kable(summary) %>%
  kable_styling("striped", full_width = F)
Neighborhood mean median min max sd cv
BrDale 98930.00 98750.0 83000 125500 13337.59 13.48184
IDOTRR 97620.69 99500.0 34900 150909 31530.44 32.29894
MeadowV 92946.88 85750.0 73000 129500 18939.78 20.37699
NridgHt 333646.74 336860.0 173000 615000 105088.90 31.49706
StoneBr 339316.05 340691.5 180000 591587 123459.10 36.38469

From the boxplot, there is a lot of variation in the median prices and ranges across the Neighborhoods and many distributions do not overlap, indicating that Neighborhood may be a strong predictor of price. The neighborhoods are ordered from lowest to highest median values.

Due to skews and outliers to which the mean value is sensitive, the median price provides the most robust price measurement for relative house prices. Using median values, there are just two neighborhoods with median price values over $300,000. These two most expensive neighborhoods are StoneBr (most expensive, median = $340,692) followed by NridgHt (median=$336,860).

There are three neighborhoods with median prices under $100,000. These least expensive neighborhoods are MeadowV (least expensive median of $85,750), followed by BrDale (median of $98,750) and then IDOTRR (median of $99,500).

The most expensive homes have the highest standard deviations, however once adjusting for the actual housing prices (coefficient of variation) it is apparaent that there is comparable variation in some of the lower priced neighborhoods as well (IDOTRR).

In the previous section, Overall.Quality was identified as a strong predictor of log(price). It is plausible that the Overall.Qual is associated with the Neighborhood. The Overall.Qual vs Neighborhood is plotted to investigate their relationship:

ggplot(ames_train, aes(x=Neighborhood_ordered, y=Overall.Qual)) + geom_boxplot() + 
  geom_jitter(color="blue", alpha=0.1, width=0.2) + scale_y_continuous(labels=comma) + 
  ggtitle("Overall.Qual by Neighborhood") + theme(axis.text.x = element_text(angle=45, hjust=1))

The neighborhoods are listed in the same order as in the previous graph (based on ascending median price). In general, the overall quality of a neighborhood increases as its median price increases (from left to right in the ordered neighborhoods), although the increase is not a monotonic increase. There appears to be an association between the Neighborhood and Overall.Qual variables, but not to the extent that they are likely to be colinear/ redundant in predicting price.


1.8 Correlation matrix identifies additional numerical variables associated with price

There are still many variables in the dataset that may be informative for price prediction. A correlation matrix will be used to get a quick overview of which numeric variables may be correlated with price. Correlation matrices are also useful for identifying colinear variables and so numeric variables already explored above will also be included in the matrix.

ames_train["Total.Baths"] = ames_train["Full.Bath"] + ames_train["Half.Bath"]/2
ames_train["LogLot.Area"] <- log(ames_train["Lot.Area"])
ames_train["LogX1st.Flr.SF"] <- log(ames_train["X1st.Flr.SF"])

corr_cols <-c("LogPrice","LogArea", "Lot.Area", "LogLot.Area","Full.Bath", "Half.Bath", "Total.Baths", "TotRms.AbvGrd", "Bedroom.AbvGr", "Year.Built", "Exter.Qual", "Kitchen.Qual", "Overall.Qual", "Pool.Area", "Year.Remod.Add", "Heating.QC",  "Fireplaces", "LogX1st.Flr.SF", "X2nd.Flr.SF", "Total.Bsmt.SF", 'Garage.Area')

ames_train_num <- ames_train[, corr_cols]
cormat <- round(cor(na.omit(ames_train_num)),1)
corrplot(cormat, method = 'number', number.cex = .9)

If variables are colinear they are contributing redundant information to the model, making it difficult to interpret coefficients. A correlation of ~0.9 is typically considered indicative of colinearity. The only variables that appear colinear here are Full.Baths and Total.Baths (a new variable generated by summing Half.Baths divided by 2 plus Full.Baths). Although in the matrix above, these variables correlate identically with price (0.6), the correlation for Total.Baths is actually slightly higher (0.61 vs 0.57) and so Total.Baths will be used in the model rather than Full.Baths.

The matrix reveals that other variables such as Total.Bsmt.SF and Garage.Area also correlate with log(price). These indicate that Basement and Garage are important for the selling price and so these or related variables will be considered for the final model.

Many variables above were log transformed as the transformation improved their normality and correlation with price. Both Lot.Area and Log of Lot.Area are included as an example above (and see also section 1.3 Area metrics are more normally distributed with log transformation) There are several variables, however, where there were lots of zero values (such as X2nd.Flr.SF (Second Floor Square Footage), Total.Basmt.SF, Garage.Area) that did not improve with log transformation. For these, log transformation was done by first converting the zeros to one (as log of zero is undefined and log of 1 is zero). Analysis with side-by-side QQPlots and correlation values (not shown) showed that, for these variables, distribution normality and correlation with log(price) were not improved with log transformation.
* * *

Part 2 - Initial Model

2.1 Simple linear regression model with 6 explanatory variables

Next, a simple model for house sale prices is generated using predictors from the EDA. The small model helps to establish a baseline for how much predictives power is possible from a relatively small number of predictors.

log(price) ~ log(area) + Overall.Qual + Exter.Qual + Kitchen.Qual + Heating.QC + Neighborhood

A total of 6 explanatory variables without missing values are used in the initial model. The response variable, price, is log transformed to generate a more normal distribution, which resulted in strong linear correlations in the EDA. The variable area is also selected because it had a strong Pearson’s r correlation greater than 0.7 with log(price). The area distribution was skewed, so it too was log transformed, resulting in a more normal data with a stronger correlation. In the EDA, the Quality metrics of Overall.Qual, Exter.Qual, Kitchen.Qual, and Heating.QC all showed linear trends with 0.52 to 0.83 pearson’s r values. Lastly, there were 26 neighborhoods in the dataset, which had very different price distributions in terms of median values and standard deviations. Neighborhood too was included because it appeared that location was strongly associated with different price distributions.

simple_model <- lm(log(price) ~ log(area) + Overall.Qual + Exter.Qual + Kitchen.Qual + Heating.QC + Neighborhood, data=ames_train)
summary(simple_model)
## 
## Call:
## lm(formula = log(price) ~ log(area) + Overall.Qual + Exter.Qual + 
##     Kitchen.Qual + Heating.QC + Neighborhood, data = ames_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.65076 -0.07956  0.00503  0.09327  0.57957 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          7.886712   0.157109  50.199  < 2e-16 ***
## log(area)            0.414408   0.022134  18.722  < 2e-16 ***
## Overall.Qual         0.096876   0.007244  13.373  < 2e-16 ***
## Exter.Qual           0.033636   0.016620   2.024 0.043271 *  
## Kitchen.Qual         0.076372   0.012937   5.903 4.93e-09 ***
## Heating.QC           0.033528   0.007268   4.613 4.50e-06 ***
## NeighborhoodBlueste -0.096547   0.110221  -0.876 0.381280    
## NeighborhoodBrDale  -0.278414   0.075051  -3.710 0.000219 ***
## NeighborhoodBrkSide -0.098466   0.058902  -1.672 0.094908 .  
## NeighborhoodClearCr  0.159107   0.070315   2.263 0.023870 *  
## NeighborhoodCollgCr  0.053101   0.053856   0.986 0.324390    
## NeighborhoodCrawfor  0.105885   0.060574   1.748 0.080776 .  
## NeighborhoodEdwards -0.053676   0.056945  -0.943 0.346128    
## NeighborhoodGilbert  0.028072   0.056531   0.497 0.619596    
## NeighborhoodGreens   0.090537   0.098970   0.915 0.360533    
## NeighborhoodGrnHill  0.370108   0.128707   2.876 0.004121 ** 
## NeighborhoodIDOTRR  -0.268169   0.060595  -4.426 1.07e-05 ***
## NeighborhoodMeadowV -0.213664   0.068182  -3.134 0.001778 ** 
## NeighborhoodMitchel  0.119277   0.058112   2.053 0.040386 *  
## NeighborhoodNAmes    0.045283   0.054558   0.830 0.406747    
## NeighborhoodNoRidge  0.136401   0.060319   2.261 0.023960 *  
## NeighborhoodNPkVill -0.061861   0.099131  -0.624 0.532754    
## NeighborhoodNridgHt  0.184372   0.055774   3.306 0.000982 ***
## NeighborhoodNWAmes   0.052050   0.058971   0.883 0.377650    
## NeighborhoodOldTown -0.190379   0.056754  -3.354 0.000826 ***
## NeighborhoodSawyer   0.066618   0.057327   1.162 0.245494    
## NeighborhoodSawyerW -0.031264   0.056822  -0.550 0.582307    
## NeighborhoodSomerst  0.064209   0.054149   1.186 0.235995    
## NeighborhoodStoneBr  0.208082   0.063444   3.280 0.001076 ** 
## NeighborhoodSWISU   -0.127625   0.071962  -1.774 0.076458 .  
## NeighborhoodTimber   0.155652   0.063571   2.448 0.014523 *  
## NeighborhoodVeenker  0.096415   0.073487   1.312 0.189827    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1674 on 968 degrees of freedom
## Multiple R-squared:  0.8466, Adjusted R-squared:  0.8417 
## F-statistic: 172.3 on 31 and 968 DF,  p-value: < 2.2e-16

In the initial simple model generated above, the Adjusted R-squared is 0.8417. Based on p-values, all variables are statistically significant. Some of the Neighborhood levels are not statistically significant, but a total of 7 of the 21 Neighborhood categories are significant at the p<0.01 and p<0.001 levels (2 and 3 star values). The neighborhood Blmngton is not displayed as a variable above and was therefore selected as the reference for the Neighborhood variable. Its median price is near the median for the whole dataset, and in consequence the seven “most significant” Neigborhoods are the three most expensive median price neigborhoods (StoneBr, NridgHt, GrnHill) with positive coefficients and the four least expensive median price neigbhorhoods (OldTown, MeadowV, BrDale, and IDOTRR) with negative coefficients in the model. While neighborhood categories have both positive and negative coefficients, all the non-Neighborhood coefficients are positive, meaning an increase in their value is associated with an increase in price. For example, each unit increase in Overall.Qual is associated with an increase of 0.096876 in log(price), on average. log(area), Overall.Qual, Kitchen.Qual, and Heating.QC are all very significant with p<0.001 while Exter.Qual barely makes the p<0.05 significance cutoff. See output below for additional details.


2.2 Model selection and feature analysis with BIC, AIC, and BMA produce similar models

Model selection can help ensure a parsimonious with only informative explanatory variables. Backwards selection methods based on AIC and BIC will be used, as well as the Bayesian Model Averaging (BMA) method (selecting for the Highest Probablity Model).

AIC and BIC are penalized-likelihood criteria methods that balance a model’s goodness of fit and its simplicity. The penalty for increased numbers of parameters is stronger with BIC. AIC is more susceptible to overfitting the data by including more predictors, while BIC may potentially underfit the data by not including as many predictors.

The Bayesian Model Averaging (BMA) method accounts for model uncertainty and generates the posterior probabilities for all possible models. After fitting, the BMA function allows for prediction using all models, the highest probability model, the median probability model, or the single model most similar to using all models. Although the Highest Probability Model (HPM) may not give the best predictions, the HPM is used for prediction in this analysis because the other options resulted in very large computational times (potentially owing to the unexpectedly large amount of models resulting from BMA, see BMA analysis below)

AIC retained all six variables in the model, while BIC and Bayesian Model Averaging evaluated Exter.Qual as less important; it was not present in the BIC selected model nor in the highest probability model for BMA, although its marginal inclusion probability was 0.35 and it was present in 2 of the 5 most probable BMA models. All other variables were included in the BIC selected model. For the BMA selection, the analysis was more complicated because the Neighborhood categories were split into individual binary categories (one for each Neighborhood). Certain neighborhoods were considered very important while others were noninformative. 10 neighborhoods were in the Highest Probability Model and 6 neighborhoods had marginal inclusion probabilities greater than 0.99. These results are consistent with certain Neighborhoods having more significant p-values in the final BIC selected model. Since I am using the Neighborhood as a single variable with multiple categories, the BIC and BMA model selection methods do for practical purposes select the same prediction variables. Exter.Qual appears to be of borderline importance in the model.

The AIC Selection is shown below:

sample_size <- dim((ames_train))[1]
simple_model_AIC <- stepAIC(simple_model, k = 2)
## Start:  AIC=-3543.73
## log(price) ~ log(area) + Overall.Qual + Exter.Qual + Kitchen.Qual + 
##     Heating.QC + Neighborhood
## 
##                Df Sum of Sq    RSS     AIC
## <none>                      27.113 -3543.7
## - Exter.Qual    1    0.1147 27.228 -3541.5
## - Heating.QC    1    0.5961 27.709 -3524.0
## - Kitchen.Qual  1    0.9761 28.089 -3510.4
## - Overall.Qual  1    5.0091 32.122 -3376.2
## - Neighborhood 26   10.2376 37.351 -3275.4
## - log(area)     1    9.8181 36.931 -3236.7
summary(simple_model_AIC)
## 
## Call:
## lm(formula = log(price) ~ log(area) + Overall.Qual + Exter.Qual + 
##     Kitchen.Qual + Heating.QC + Neighborhood, data = ames_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.65076 -0.07956  0.00503  0.09327  0.57957 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          7.886712   0.157109  50.199  < 2e-16 ***
## log(area)            0.414408   0.022134  18.722  < 2e-16 ***
## Overall.Qual         0.096876   0.007244  13.373  < 2e-16 ***
## Exter.Qual           0.033636   0.016620   2.024 0.043271 *  
## Kitchen.Qual         0.076372   0.012937   5.903 4.93e-09 ***
## Heating.QC           0.033528   0.007268   4.613 4.50e-06 ***
## NeighborhoodBlueste -0.096547   0.110221  -0.876 0.381280    
## NeighborhoodBrDale  -0.278414   0.075051  -3.710 0.000219 ***
## NeighborhoodBrkSide -0.098466   0.058902  -1.672 0.094908 .  
## NeighborhoodClearCr  0.159107   0.070315   2.263 0.023870 *  
## NeighborhoodCollgCr  0.053101   0.053856   0.986 0.324390    
## NeighborhoodCrawfor  0.105885   0.060574   1.748 0.080776 .  
## NeighborhoodEdwards -0.053676   0.056945  -0.943 0.346128    
## NeighborhoodGilbert  0.028072   0.056531   0.497 0.619596    
## NeighborhoodGreens   0.090537   0.098970   0.915 0.360533    
## NeighborhoodGrnHill  0.370108   0.128707   2.876 0.004121 ** 
## NeighborhoodIDOTRR  -0.268169   0.060595  -4.426 1.07e-05 ***
## NeighborhoodMeadowV -0.213664   0.068182  -3.134 0.001778 ** 
## NeighborhoodMitchel  0.119277   0.058112   2.053 0.040386 *  
## NeighborhoodNAmes    0.045283   0.054558   0.830 0.406747    
## NeighborhoodNoRidge  0.136401   0.060319   2.261 0.023960 *  
## NeighborhoodNPkVill -0.061861   0.099131  -0.624 0.532754    
## NeighborhoodNridgHt  0.184372   0.055774   3.306 0.000982 ***
## NeighborhoodNWAmes   0.052050   0.058971   0.883 0.377650    
## NeighborhoodOldTown -0.190379   0.056754  -3.354 0.000826 ***
## NeighborhoodSawyer   0.066618   0.057327   1.162 0.245494    
## NeighborhoodSawyerW -0.031264   0.056822  -0.550 0.582307    
## NeighborhoodSomerst  0.064209   0.054149   1.186 0.235995    
## NeighborhoodStoneBr  0.208082   0.063444   3.280 0.001076 ** 
## NeighborhoodSWISU   -0.127625   0.071962  -1.774 0.076458 .  
## NeighborhoodTimber   0.155652   0.063571   2.448 0.014523 *  
## NeighborhoodVeenker  0.096415   0.073487   1.312 0.189827    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1674 on 968 degrees of freedom
## Multiple R-squared:  0.8466, Adjusted R-squared:  0.8417 
## F-statistic: 172.3 on 31 and 968 DF,  p-value: < 2.2e-16

All variables were retained. Next, the BIC Selection:

sample_size <- dim((ames_train))[1]
simple_model_BIC <- stepAIC(simple_model, k = log(sample_size))
## Start:  AIC=-3386.68
## log(price) ~ log(area) + Overall.Qual + Exter.Qual + Kitchen.Qual + 
##     Heating.QC + Neighborhood
## 
##                Df Sum of Sq    RSS     AIC
## - Exter.Qual    1    0.1147 27.228 -3389.4
## <none>                      27.113 -3386.7
## - Heating.QC    1    0.5961 27.709 -3371.8
## - Kitchen.Qual  1    0.9761 28.089 -3358.2
## - Neighborhood 26   10.2376 37.351 -3246.0
## - Overall.Qual  1    5.0091 32.122 -3224.1
## - log(area)     1    9.8181 36.931 -3084.6
## 
## Step:  AIC=-3389.37
## log(price) ~ log(area) + Overall.Qual + Kitchen.Qual + Heating.QC + 
##     Neighborhood
## 
##                Df Sum of Sq    RSS     AIC
## <none>                      27.228 -3389.4
## - Heating.QC    1    0.6514 27.879 -3372.6
## - Kitchen.Qual  1    1.3779 28.606 -3346.9
## - Neighborhood 26   10.6497 37.878 -3238.9
## - Overall.Qual  1    5.9170 33.145 -3199.6
## - log(area)     1    9.7368 36.965 -3090.6
summary(simple_model_BIC)
## 
## Call:
## lm(formula = log(price) ~ log(area) + Overall.Qual + Kitchen.Qual + 
##     Heating.QC + Neighborhood, data = ames_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.63654 -0.08198  0.00589  0.09348  0.56659 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          7.965155   0.152496  52.232  < 2e-16 ***
## log(area)            0.412176   0.022142  18.615  < 2e-16 ***
## Overall.Qual         0.101013   0.006961  14.511  < 2e-16 ***
## Kitchen.Qual         0.085301   0.012181   7.003 4.69e-12 ***
## Heating.QC           0.034896   0.007248   4.815 1.71e-06 ***
## NeighborhoodBlueste -0.117601   0.109904  -1.070 0.284871    
## NeighborhoodBrDale  -0.296029   0.074664  -3.965 7.88e-05 ***
## NeighborhoodBrkSide -0.113834   0.058503  -1.946 0.051971 .  
## NeighborhoodClearCr  0.145943   0.070125   2.081 0.037679 *  
## NeighborhoodCollgCr  0.048785   0.053899   0.905 0.365632    
## NeighborhoodCrawfor  0.094079   0.060388   1.558 0.119583    
## NeighborhoodEdwards -0.065915   0.056713  -1.162 0.245422    
## NeighborhoodGilbert  0.017654   0.056386   0.313 0.754280    
## NeighborhoodGreens   0.091272   0.099128   0.921 0.357408    
## NeighborhoodGrnHill  0.371538   0.128910   2.882 0.004037 ** 
## NeighborhoodIDOTRR  -0.283372   0.060224  -4.705 2.90e-06 ***
## NeighborhoodMeadowV -0.226251   0.068006  -3.327 0.000911 ***
## NeighborhoodMitchel  0.103059   0.057649   1.788 0.074135 .  
## NeighborhoodNAmes    0.030711   0.054167   0.567 0.570869    
## NeighborhoodNoRidge  0.134391   0.060407   2.225 0.026328 *  
## NeighborhoodNPkVill -0.082814   0.098746  -0.839 0.401869    
## NeighborhoodNridgHt  0.188669   0.055823   3.380 0.000754 ***
## NeighborhoodNWAmes   0.035537   0.058497   0.608 0.543655    
## NeighborhoodOldTown -0.206296   0.056297  -3.664 0.000261 ***
## NeighborhoodSawyer   0.050148   0.056837   0.882 0.377831    
## NeighborhoodSawyerW -0.036655   0.056851  -0.645 0.519230    
## NeighborhoodSomerst  0.061871   0.054223   1.141 0.254129    
## NeighborhoodStoneBr  0.210631   0.063533   3.315 0.000949 ***
## NeighborhoodSWISU   -0.143103   0.071668  -1.997 0.046133 *  
## NeighborhoodTimber   0.152795   0.063657   2.400 0.016570 *  
## NeighborhoodVeenker  0.089277   0.073519   1.214 0.224916    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1676 on 969 degrees of freedom
## Multiple R-squared:  0.8459, Adjusted R-squared:  0.8412 
## F-statistic: 177.3 on 30 and 969 DF,  p-value: < 2.2e-16

As shown in the BIC selected simple model above, only Exter.Qual was removed. The final AdjR2 value is 0.8412 which is slightly lower than the 0.8417 for the AIC. Below is the BMA analysis for the simple model:

simple_model_bma = bas.lm(log(price) ~ log(area) + Overall.Qual + Exter.Qual + Kitchen.Qual + Heating.QC + Neighborhood, data=ames_train, prior = "BIC", modelprior = uniform())

kable(summary(simple_model_bma)) %>%
  kable_styling("striped", full_width = F)
P(B != 0 | Y) model 1 model 2 model 3 model 4 model 5
Intercept 1.0000000 1.0000 1.0000000 1.0000000 1.0000000 1.0000000
log(area) 1.0000000 1.0000 1.0000000 1.0000000 1.0000000 1.0000000
Overall.Qual 1.0000000 1.0000 1.0000000 1.0000000 1.0000000 1.0000000
Exter.Qual 0.3499009 0.0000 0.0000000 1.0000000 0.0000000 1.0000000
Kitchen.Qual 0.9999988 1.0000 1.0000000 1.0000000 1.0000000 1.0000000
Heating.QC 0.9999606 1.0000 1.0000000 1.0000000 1.0000000 1.0000000
NeighborhoodBlueste 0.1171534 0.0000 0.0000000 0.0000000 0.0000000 0.0000000
NeighborhoodBrDale 0.9999999 1.0000 1.0000000 1.0000000 1.0000000 1.0000000
NeighborhoodBrkSide 0.9999970 1.0000 1.0000000 1.0000000 1.0000000 1.0000000
NeighborhoodClearCr 0.1966638 0.0000 0.0000000 0.0000000 0.0000000 0.0000000
NeighborhoodCollgCr 0.0399828 0.0000 0.0000000 0.0000000 0.0000000 0.0000000
NeighborhoodCrawfor 0.0689455 0.0000 0.0000000 0.0000000 0.0000000 0.0000000
NeighborhoodEdwards 0.9997429 1.0000 1.0000000 1.0000000 1.0000000 1.0000000
NeighborhoodGilbert 0.1306534 0.0000 0.0000000 0.0000000 0.0000000 0.0000000
NeighborhoodGreens 0.0329364 0.0000 0.0000000 0.0000000 0.0000000 0.0000000
NeighborhoodGrnHill 0.4969697 0.0000 1.0000000 0.0000000 1.0000000 1.0000000
NeighborhoodIDOTRR 1.0000000 1.0000 1.0000000 1.0000000 1.0000000 1.0000000
NeighborhoodMeadowV 0.9999998 1.0000 1.0000000 1.0000000 1.0000000 1.0000000
NeighborhoodMitchel 0.2622759 0.0000 0.0000000 0.0000000 0.0000000 0.0000000
NeighborhoodNAmes 0.1035758 0.0000 0.0000000 0.0000000 0.0000000 0.0000000
NeighborhoodNoRidge 0.2734670 0.0000 0.0000000 0.0000000 0.0000000 0.0000000
NeighborhoodNPkVill 0.0966903 0.0000 0.0000000 0.0000000 0.0000000 0.0000000
NeighborhoodNridgHt 0.9970785 1.0000 1.0000000 1.0000000 1.0000000 1.0000000
NeighborhoodNWAmes 0.0400001 0.0000 0.0000000 0.0000000 0.0000000 0.0000000
NeighborhoodOldTown 1.0000000 1.0000 1.0000000 1.0000000 1.0000000 1.0000000
NeighborhoodSawyer 0.0322628 0.0000 0.0000000 0.0000000 0.0000000 0.0000000
NeighborhoodSawyerW 0.9532231 1.0000 1.0000000 1.0000000 1.0000000 1.0000000
NeighborhoodSomerst 0.0335742 0.0000 0.0000000 0.0000000 0.0000000 0.0000000
NeighborhoodStoneBr 0.9453852 1.0000 1.0000000 1.0000000 1.0000000 1.0000000
NeighborhoodSWISU 0.9841571 1.0000 1.0000000 1.0000000 1.0000000 1.0000000
NeighborhoodTimber 0.3539203 0.0000 0.0000000 0.0000000 1.0000000 0.0000000
NeighborhoodVeenker 0.0356709 0.0000 0.0000000 0.0000000 0.0000000 0.0000000
BF NA 1.0000 0.9862185 0.6505912 0.5564458 0.5186063
PostProbs NA 0.0383 0.0378000 0.0249000 0.0213000 0.0199000
R2 NA 0.8395 0.8406000 0.8404000 0.8415000 0.8415000
dim NA 15.0000 16.0000000 16.0000000 17.0000000 17.0000000
logmarg NA -1724.4386 -1724.4524966 -1724.8684932 -1725.0248048 -1725.0952296

Although the initial included only 6 variables, the Neighborhood categorical variable was broken out into binary variables during the bayesian modeling (based on its 27 levels), resulting in an expansion to 31 explanatory variables. As a result, the analysis generated a total of 2^31, or 2147483648 possible models. The above is a summary of the top five models. From the posterior probabilities, we see that the first two models each have posterior probabilities of approx ~ 3.8%. This means that these two models comprise only 7.5% of the total posterior probability across all the generated models. All of the top five models include log(area), Overall.Qual, Kitchen.Qual, and Heating.QC. The Exter.Qual is not in the Highest Probability Model and is included in 2 of the top five models with a relatively low overall inclusion probability of 0.35.

The inclusion probabilities can be better visualized for variables across the model by ranking them from greatest inclusion probability to lowest:

coef_values = coef(simple_model_bma, estimator='BMA')
post_probs = coef_values$probne0 
post_names = coef_values$namesx
coef_df = data.frame(post_probs, post_names) 
coef_df <- coef_df[order(coef_df$post_probs, decreasing=TRUE),]
ggplot(data= coef_df, aes(x=reorder(post_names, -post_probs), y=post_probs)) + geom_col() + ggtitle("Inclusion Probabilities") +  xlab("Variable") + ylab("Marginal Inclusion Probability") +
    theme(axis.text.x = element_text(angle = 50, hjust = 1), plot.title = element_text(hjust = 0.4))

Neighborhoods with more extreme median prices have stronger inclusion probabilities. The four least expensive median price neigbhorhoods (OldTown, MeadowV, BrDale, and IDOTRR) are the four neighborhood categories with the greatest marginal inclusion probabilities. The relative neighborhood rankings do not correspond completely to their relative p-value importances observed in the initial unselected model. For example, GrnHill looks relatively less important in the BMA ranking here. There are 10 neighborhoods with inclusion probabilities greather than 0.94 while the inclusion probability of GrnHill is less than 0.5. In the initial simple model, GrnHill’s p-value was in the top 7 neighborhoods in terms of significance.

With an inclusion probability of 0.35, Exter.Qual appears much less important than the other quality metrics (Overall.Qual, Kitchen.Qual, Heating.QC) as was also reflected in its barely significant p-value of the initial model and in the elimination of Exter.Qual in the BIC selected model.


2.3 Residual Plots of AIC Selected Model and Identification of Outliers

To assess the model performance, the residuals will be examined with diagnostic plots. For the model to be a good representation of the data, the residuals should be distributed around zero with constant scatter and should be normally distributed. I will check whether the model data meet these requirements with a residual plot, histogram, and Q-Q plot. The BIC selected model is used for analysis below.

From the plots below, we see that the residuals do look for the most part normally distributed with random scatter around zero. There is some deviation from a straight line for low and high values in the QQPlot, indicating that there is some deviation from normality indicating that the model may not perform as well for very high or low values.

plot(simple_model_AIC, which=1, add.smooth = F)

hist(simple_model_AIC$residuals)

plot(simple_model_AIC, which=2, add.smooth = F)

There are several outliers in the data, with observation #428 an extreme outlier. The observation will be examined to understand if there is any indication why the house had such a low actual sale price.

outlier <- ames_train[428,]
print(exp(predict(simple_model_BIC, outlier)))
##        1 
## 65701.64
kable(t(outlier)) %>%
  kable_styling("striped", full_width = F)
PID 902207130
area 832
price 12789
MS.SubClass 30
MS.Zoning RM
Lot.Frontage 68
Lot.Area 9656
Street Pave
Alley NA
Lot.Shape Reg
Land.Contour Lvl
Utilities AllPub
Lot.Config Inside
Land.Slope Gtl
Neighborhood OldTown
Condition.1 Norm
Condition.2 Norm
Bldg.Type 1Fam
House.Style 1Story
Overall.Qual 2
Overall.Cond 2
Year.Built 1923
Year.Remod.Add 1970
Roof.Style Gable
Roof.Matl CompShg
Exterior.1st AsbShng
Exterior.2nd AsbShng
Mas.Vnr.Type None
Mas.Vnr.Area 0
Exter.Qual 3
Exter.Cond 2
Foundation BrkTil
Bsmt.Qual Fa
Bsmt.Cond Fa
Bsmt.Exposure No
BsmtFin.Type.1 Unf
BsmtFin.SF.1 0
BsmtFin.Type.2 Unf
BsmtFin.SF.2 0
Bsmt.Unf.SF 678
Total.Bsmt.SF 678
Heating GasA
Heating.QC 3
Central.Air N
Electrical SBrkr
X1st.Flr.SF 832
X2nd.Flr.SF 0
Low.Qual.Fin.SF 0
Bsmt.Full.Bath 0
Bsmt.Half.Bath 0
Full.Bath 1
Half.Bath 0
Bedroom.AbvGr 2
Kitchen.AbvGr 1
Kitchen.Qual 3
TotRms.AbvGrd 5
Functional Typ
Fireplaces 1
Fireplace.Qu Gd
Garage.Type Detchd
Garage.Yr.Blt 1928
Garage.Finish Unf
Garage.Cars 2
Garage.Area 780
Garage.Qual Fa
Garage.Cond Fa
Paved.Drive N
Wood.Deck.SF 0
Open.Porch.SF 0
Enclosed.Porch 0
X3Ssn.Porch 0
Screen.Porch 0
Pool.Area 0
Pool.QC NA
Fence NA
Misc.Feature NA
Misc.Val 0
Mo.Sold 6
Yr.Sold 2010
Sale.Type WD
Sale.Condition Abnorml
LogPrice 9.456341
LogArea 6.723832
Neighborhood_ordered OldTown
Total.Baths 1
LogLot.Area 9.175335
LogX1st.Flr.SF 6.723832

Home #428 in the dataset has a very low price of only $12,789. The predicted price is $65,702 when converted from log scale to dollars for the BIC selected model (AIC and the BMA highest probability models predict similar values of $66,643 and $63,407). The house price was much lower than predicted. It is a very old house with overall quality score (Overall.Qual) of 2 on a scale of 10. The Overall.Cond score is also very low (2) and the basement is described as unfinished. The Year.Remod.Add is also far back with the year 1970 and the Sale.Condition is Abnorml. The data indicate that the house is in very bad condition. The Overall.Qual is included in the model and many of the others variables indicating poor condition are not. Sale.Condition was not included in the model and it could be that non-normal sales are difficult to model alongside normal sales. This will be investigated further in section [2.5: Analysis of Sale.Condition: Modeling normal sales only improves overall model


2.4 Relative Model Performance w/ RMSE: AIC selected model performs best

Next, to evaluate the relative performance of the models, the Root Mean Squared Error, which corresponds to the standard deviation of the residuals, will be calculated on the train set above as well as the test.

load("ames_test.Rdata")
#Convert model factor variables in test set to numeric, as was done in the train set:
ames_test <- modify_levels(ames_test)
ames_test <- ames_test %>% filter(Neighborhood != "Landmrk")

train_rmses <- c(sqrt(mean((ames_train$price - exp(predict(simple_model_AIC, ames_train)))**2)), sqrt(mean((ames_train$price - exp(predict(simple_model_BIC, ames_train)))**2)), sqrt(mean((ames_train$price - exp(predict(simple_model_bma, newdata = ames_train, estimator="HPM")$fit))**2)))

test_rmses <- c(sqrt(mean((ames_test$price - exp(predict(simple_model_AIC, ames_test)))**2)), sqrt(mean((ames_test$price - exp(predict(simple_model_BIC, ames_test)))**2)), sqrt(mean((ames_test$price - exp(predict(simple_model_bma, newdata = ames_test, estimator="HPM")$fit))**2)))

kable(data.frame("Selection.Method" = c("AIC", "BIC", "BMA(HPM)"), "Train RMSE" = train_rmses, "Test RMSE" = test_rmses)) %>% kable_styling("striped", full_width = F)
Selection.Method Train.RMSE Test.RMSE
AIC 32571.49 27951.53
BIC 32792.36 28073.66
BMA(HPM) 33666.38 29820.26

RMSE value are reported in dollars.

Note: There was one observation in the test set which contained a new (unseen) neighborhood, Landmrk. In order to predict prices, this one observation was removed.

The RMSE values for the train set are around $33,000 which is reasonable given the median home price of $159,467. The AIC selection generated the lowest RMSE values ($32,571.49 for the train set and $27,951.53 for the test set), indicating that BIC may have underfit the data. It thus appears the inclusion of Exter.Qual in the AIC Selection improved the predictive power of the model. For all of the model selection methods, the test set RMSE is lower than the train set RMSE. There is therefore no evidence of overfitting. Generally, data fit the train set better because that is the “seen” data, and so it is unexpected that the RMSE of the train set would be lower. There were some strong outliers in the train set, however, and so it’s possible that these values resulted in an elevated train RMSE and that the test set contained observations that were easier to predict.

The Highest Probability Model (HPM) performed the worst in the analysis. Because Neighborhood was broken out into binary variables and they were not all included in the HPM, it contained the least amount of variable information, consistent with the AIC (which has the most variables) having the best score. The results suggest that there is more predictive power that may be gained by including additional variables in the model, which will be pursued in part 3. There are other Bayesian Model Method choices, such as using all the models in an ensemble, which could potentially result in better predictions. However, the computation of these was very slow and not feasible for this project.

Next, however, I will look more closely at the Sale.Condition as this was found to potentially be an explanatory factor for the strong outlier in the data, and may explain the surprisingly lower RMSE in the train set; there may be some houses that are difficult to model associated with Sale.Condition.

2.5 Analysis of Sale.Condition: Modeling normal sales only improves overall model

ggplot(ames_train, aes(x=log(area), y=log(price), group=Sale.Condition)) + 
  labs(title="Training Set: log(price) vs log(area) grouped by Sale.Condition") +
  geom_point(aes(shape=Sale.Condition, color=Sale.Condition), size=1)

summary <- ames_train %>% group_by(Sale.Condition) %>% summarize(median.log.price=median(log(price)), median.log.area=median(log(area)))

kable(summary) %>% kable_styling("striped", full_width = F)
Sale.Condition median.log.price median.log.area
Abnorml 11.79056 7.181592
AdjLand 11.83713 7.389934
Alloca 11.93523 7.300527
Family 11.91170 7.305188
Normal 11.95440 7.232010
Partial 12.44566 7.437070

The log(area) and log(price) for homes grouped by Sale.Condition are displayed above. An abnormal sale is a sale that is not typical of the current market, and can result from family transactions, incorrectly priced homes, potential foreclosure, etc. The subcategory of family sales are listed separately and do tend to be selling for less than Normal sale values, despite being larger (based on median values for log(price) and log(area)). The same is true for AdjLand and Alloca sales. The Abnormal sales have the lowest median.log.price but also the lowest median.log.area. There do appear to be outliers associated with Abnorml based on the scatterplot. Partial sales, in contrast, tend to be larger and sell for higher values (based on median values in the table); Their sell prices are greater than expected for their size based on the scatter plot.

Next, the predicted vs actual home prices in the train set will be plotted and the RMSE by Sale.Condition will be calculated to determine if there is indeed more error in the model’s price predictions for homes sold under non-normal Conditions:

ggplot(ames_train, aes(x=log(price), y=predict(simple_model_AIC, ames_train), group=Sale.Condition)) +
  geom_point(aes(shape=Sale.Condition, color=Sale.Condition), size=1) + 
  labs(title="Training Set: Predicted vs Actual log(price) grouped by Sale.Condition", 
       y="Predicted Log(price) AIC Model") +
  geom_abline(intercept=0, slope=1)

sale_types <- unique(as.vector(ames_train$Sale.Condition))
counts <- vector("numeric")
rmse_by_type <- vector("numeric")
for (i in sale_types){
  data_subset <- ames_train[ames_train$Sale.Condition==i,]
  counts[i] <- dim(data_subset)[1]
  rmse_by_type[i] <- sqrt(mean((data_subset$price - exp(predict(simple_model_AIC, data_subset)))**2))
}
kable(data.frame("counts" = counts, "Test RMSE.$" = rmse_by_type)) %>% kable_styling("striped", full_width = F)
counts Test.RMSE..
Normal 834 26339.023
Partial 82 66431.955
AdjLand 2 4393.754
Abnorml 61 33777.533
Alloca 4 33882.151
Family 17 52136.512

The diagonal line above is the identity line, corresponding to perfect predictions. The normal sales (in blue) comprise 83.4% of the data. A large fraction of the outlier homes with higher than actual predictions, however, are family, abnormal, and partial sales. The partial sales (in pink) tend to cluster in the upper right, indicating that they were predicted to sell for less than their actual selling price. This is consistent with the observation in the previous graph that partial sales sell for more than expected based on the log(area) variable. The table shows the RMSE values for sales based on the Sale.Condition. With the exception of Adjusted Land sales (for which there are only two), normal sales have much lower RMSE values than other sale types, indicating more accurate predictions. Compared to the normal sale RMSE, abnormal sale RMSE is over 28% higher, family sale RMSE is almost double, and partial sale RMSE is more than double.

Given the impact of different Sale.Conditions, the home prices will be modeled considering the 6 variables of interest using only houses sold under normal conditions. Keeping only normal sales retains 834 of the 1000 sales in the Train dataset. Below, the new fitting and selection with BIC and AIC is conducted using normal sales only (output supressed) and the final models with AdjR2 values are displayed.

train_normal = ames_train[ames_train["Sale.Condition"] == "Normal",]
test_normal = ames_test[ames_test["Sale.Condition"] == "Normal",]
simple_model_normal <- lm(log(price) ~ log(area) + Overall.Qual + Exter.Qual + Kitchen.Qual + Heating.QC + Neighborhood, data=train_normal)
simple_model_normal_AIC <- stepAIC(simple_model_normal, k=2)
sample_size <- dim(train_normal)[1]
simple_model_normal_BIC <- stepAIC(simple_model_normal, k=log(sample_size))
## Simple Model - Normal Sales Only - AIC Selected:
## lm(formula = log(price) ~ log(area) + Overall.Qual + Exter.Qual + 
##     Kitchen.Qual + Heating.QC + Neighborhood, data = ames_train)
## The AdjR2 is 0.8416684
## Simple Model - Normal Sales Only - BIC Selected:
## lm(formula = log(price) ~ log(area) + Overall.Qual + Kitchen.Qual + 
##     Heating.QC + Neighborhood, data = ames_train)
## The AdjR2 is 0.8411626

As before, the AIC retains all variables and the BIC removes the Exter.Qual variable. The coefficients for the variables differ slightly from their values in the initial models. The final variables and the AdjR2 of the models are the same.

The RMSE for the train and test data for normal sales will now be calculated and compared to the data using all sales:

train_rmses_n <- c(sqrt(mean((train_normal$price - exp(predict(simple_model_normal_AIC, train_normal)))**2)), sqrt(mean((train_normal$price - exp(predict(simple_model_normal_BIC, train_normal)))**2)))

test_rmses_n <- c(sqrt(mean((test_normal$price - exp(predict(simple_model_normal_AIC, test_normal)))**2)), sqrt(mean((test_normal$price - exp(predict(simple_model_normal_BIC, test_normal)))**2)))

kable(data.frame("Selection.Method" = c("AIC", "BIC"), "Norm Sale Train RMSE" = train_rmses_n, "Norm Sale Test RMSE" = test_rmses_n, "All Train RMSE" = train_rmses[1:2], "All Test RMSE" = test_rmses[1:2])) %>% kable_styling("striped", full_width = F)
Selection.Method Norm.Sale.Train.RMSE Norm.Sale.Test.RMSE All.Train.RMSE All.Test.RMSE
AIC 26168.18 27724.07 32571.49 27951.53
BIC 26407.20 27839.21 32792.36 28073.66

All values shown are in dollars. Modeling only Normal Sales resulted in the expected result that the train RMSE was lower than the Test RMSE. The differences in train and test RMSE for Normal sales are relatively low (approximately $1,500), suggesting overfitting is not an issue. Most of the RMSEs are lower when using the Normal sale information only, indicating that modeling the subset of the data generates a stronger model.

An alternative to modeling only normal sales is to include the Sale.Condition category in the modeling process to account for the condition. From additional analysis (not shown), I concluded that while including the Sale.Condition as a variable in the model for all sales did modestly improve its performance (for both the simple model here and for larger models as generated in part 3) it consistently resulted in the unusual result of train RMSEs with higher values than test RMSEs, which does not happen when only normal sales are considered. Further analysis of the data sets revealed that it is only in the train set that there are observations with not normal Sale.Condition, indicating an unusual, non-random partition of the data. Consequently, I decided to only attempt to model normal sales in the larger model moving forward into part 3.


Part 3 - Final Model

Next, a larger model will be generated to see if it is has better predictive power than the model generated in Part 2.

3.1 Feature Engineering and Linear Regression Model with 21 Explanatory Variables

Before selection, the larger model had 21 variables. These include the initial 6 variables from the simple model as well as 10 others that appeared to be associated with differences in log(price) based on additional plotting and correlation calculations (not shown). Although few houses had pools, those with pools had higher home values, so a binary y/n pool variable was created from multi-level Pool.QC variable based on whether the home has a pool. As discussed in the EDA, many variables had “NA” for some observations, however, these often corresponded to not having the given feature. Metrics for the basement and garage with NA were then converted to “No Feature” (for feature such as Quality) or “0” (for features such as number of cars or bathrooms) and used as categorical variables.

Function to modify variables as discussed above:

modify_variables <- function(df) {
  
  levels(df$Bsmt.Qual)<-c(levels(df$Bsmt.Qual),"NoBsmt")  
  df$Bsmt.Qual[is.na(df$Bsmt.Qual)] <- "NoBsmt"          
  df$Bsmt.Qual <- replace(df$Bsmt.Qual, which(df$Bsmt.Qual == ''), 'NoBsmt')  #one entry is blank

  df$Bsmt.Full.Bath[is.na(df$Bsmt.Full.Bath)] <- 0
  df$Bsmt.Full.Bath <- as.factor(df$Bsmt.Full.Bath)

  levels(df$Garage.Finish)<-c(levels(df$Garage.Finish),"NoGrg")  
  df$Garage.Finish[is.na(df$Garage.Finish)] <- "NoGrg"          
  df$Garage.Finish <- replace(df$Garage.Finish, which(df$Garage.Finish == ''), 'NoGrg')

  df$Garage.Cars[is.na(df$Garage.Cars)] <- 0
  df$Garage.Cars <- as.factor(df$Garage.Cars)

  levels(df$Garage.Qual)<-c(levels(df$Garage.Qual),"NoGrg")  
  df$Garage.Qual[is.na(df$Garage.Qual)] <- "NoGrg"
  df$Garage.Qual <- replace(df$Garage.Qual, which(df$Garage.Qual == ''), 'NoGrg')

  df$Pool <-  ifelse(is.na(df$Pool.QC), "No", "Yes")
  df
}
ames_train <- modify_variables(ames_train)

Full tested model before selection:

lm(log(price) ~ log(Lot.Area) + Overall.Qual + Exter.Qual + Kitchen.Qual + Heating.QC + Neighborhood + Fireplaces + Total.Baths + Year.Built + Central.Air + Year.Remod.Add + log(X1st.Flr.SF) + X2nd.Flr.SF + Garage.Finish + Garage.Qual + Garage.Cars + Bsmt.Qual + Bsmt.Full.Bath + Pool + Paved.Drive + Electrical

3.2 BIC and AIC selection methods produce different models

The AIC and BIC selection functions were run on normal sale data only (for the reasons described in Section 2.5 Analysis of Sale.Condition). The lengthy output from the selection functions is suppressed and summaries are provided:

train_normal <- ames_train[ames_train["Sale.Condition"] == "Normal", ]

full_model <- lm(log(price) ~ log(Lot.Area) + Overall.Qual + Exter.Qual + Kitchen.Qual + Heating.QC + Neighborhood + Fireplaces + Total.Baths + Year.Built + Central.Air + Year.Remod.Add + log(X1st.Flr.SF) + X2nd.Flr.SF + Garage.Finish +  Garage.Qual + Garage.Cars + Bsmt.Qual + Bsmt.Full.Bath + Pool + Paved.Drive + Electrical, data=train_normal)

full_model_AIC <- stepAIC(full_model, k=2)
sample_size = dim(train_normal)[1]
full_model_BIC <- stepAIC(full_model, k=log(sample_size))
## Full Model AIC Selected:
## lm(formula = log(price) ~ log(Lot.Area) + Overall.Qual + Exter.Qual + 
##     Kitchen.Qual + Heating.QC + Neighborhood + Fireplaces + Year.Built + 
##     Central.Air + Year.Remod.Add + log(X1st.Flr.SF) + X2nd.Flr.SF + 
##     Garage.Qual + Garage.Cars + Bsmt.Qual + Bsmt.Full.Bath + 
##     Paved.Drive, data = train_normal)
## The AdjR2 is 0.9263294
## Full Model BIC Selected:
## lm(formula = log(price) ~ log(Lot.Area) + Overall.Qual + Exter.Qual + 
##     Kitchen.Qual + Heating.QC + Fireplaces + Year.Built + Central.Air + 
##     Year.Remod.Add + log(X1st.Flr.SF) + X2nd.Flr.SF + Bsmt.Qual + 
##     Bsmt.Full.Bath + Paved.Drive, data = train_normal)
## The AdjR2 is 0.9138402

The Electrical, Total.Baths, Pool, and Garage.Finish variables were eliminated with both selection methods.

The expanded model AIC has 17 variables and includes all 6 variables of the initial AIC selected model. With the additional 11 variables, the AdjR2 has increased from 0.8417 (for the simple model) to 0.93.

The final BIC model has 14 variables. Neighborhood is not included in the BIC Model although Neighborhood was included in the simple BIC model created previously. Neighborhood was actually the first variable dropped in the iterative BIC selection process; it was the most complex variable with 26 degrees of freedom which may explain its exclusion. The BIC adjusted R2 increased from 0.8412 (for the simple BIC model with 5 variables) to 0.91 in the expanded model with 14 variables.

The AIC models contains all of the variables in the BIC model with the addition of Neigborhood, Garage.Qual, and Garage.Cars. The BIC model does not contain any variables related to Garage.


3.3 Residual Plots of AIC Selected Model

Residual plots will be next be analyzed to determine if the model assumptions are met. The more complex model (AIC) will be evluated.

plot(full_model_AIC, which=1, add.smooth = F)

hist(full_model_AIC$residuals)

plot(full_model_AIC, which=2, add.smooth = F)

From the plots above, the residuals do look for the most part normally distributed with random scatter around zero. The plots appear to more normal than what was observed for the AIC selected simple model. There is some deviation from a straight line for low and high values in the QQPlot, indicating some deviation from normality and suggesting that the model may not perform as well for very high or low selling houses. The previous strong outlier from the simple model, 428, is not considered in this model because that was a not normal sale. In the simple model, it had a Standardized Residual value of -10 in the QQPlot. In the current model, the largest residual (611) has a value of -7.

The model performance will next be evaluated with RMSE calculations.


3.4 Relative Model Performance w/ RMSE: AIC selected model again performs best

The RMSE on train and test data sets will be calculated to evaluate model performance and identify potential overfitting. As the new models have more variables, they are more prone to overfitting.

ames_test <- modify_variables(ames_test) #add modified and new variables to test set

full_train_rmses <- c(sqrt(mean((train_normal$price - exp(predict(full_model_AIC, train_normal)))**2)), sqrt(mean((train_normal$price - exp(predict(full_model_BIC, train_normal)))**2)))

full_test_rmses <- c(sqrt(mean((ames_test$price - exp(predict(full_model_AIC, ames_test)))**2)), sqrt(mean((ames_test$price - exp(predict(full_model_BIC, ames_test)))**2)))

kable(data.frame("Selection.Method"=c("AIC", "BIC"), "Full.Model.Train.RMSE"=full_train_rmses, "Full.Model.Test.RMSE"=full_test_rmses, "Simple.Model.Train.RMSE"=train_rmses_n, "Simple.Model.Test.RMSE"=test_rmses_n)) %>% kable_styling("striped", full_width = F)
Selection.Method Full.Model.Train.RMSE Full.Model.Test.RMSE Simple.Model.Train.RMSE Simple.Model.Test.RMSE
AIC 18606.88 20663.24 26168.18 27724.07
BIC 20415.97 21734.53 26407.20 27839.21

All the RMSEs above are in dollars and are from models using normal sales only for both fitting and predicting.

The RMSE for the final AIC model on the train set is $18606.88, which is less than the RMSE of the simple model on the train set ($26,618.18). An improvement in the train set RMSE could be indicative of overfitting or improved performance. There is also a strong decrease in the test RMSE ($20,663.24 vs $27,724.07) indicating improved predictive performance.

Overall, the expanded models outperform the simple models with RMSEs reduced by ~$6,000 or more, corresponding to over 20% RMSE reductions.

Based on RMSE values, the AIC model outperforms the BIC model. There is potentially more overfitting with the AIC model, however, as the RMSE of the test set is over $2,000 greater than the train set for the AIC model. For the BIC model, the RMSE of the test set is about $1,360 more than the train set.

Test Set RMSE:

The RMSE for the final model test set is $24165.97, which is less than RMSE of $28073.66 for simple model on the test set. This means the additional variables likely have improved the model’s prediction power, as we observe more accurate predictions on the unseen data.


3.5 Based on coverage probability, the AIC model correctly reflects uncertainty

A final evaluation of the best model (AIC) will be done with a third dataset, the validation set which has 763 observations.

The RMSE metric is sensitive to outliers due to the square of the error. Another metric to consider for model evaluation is coverage, which looks at whether the model properly reflects uncertainty by determining what fraction of confidence interval predictions contain the actual price.

load("ames_validation.Rdata")
#modifying variables, as was done for the train and test datasets
ames_validation <- modify_levels(ames_validation)
ames_validation <- modify_variables(ames_validation)

predict_final_val <- exp(predict(full_model_AIC, ames_validation))

predict_final_val_int <- exp(predict(full_model_AIC, ames_validation, interval="prediction", level=0.95))
interval_dataframe = as.data.frame(predict_final_val_int)
interval_dataframe['price'] = ames_validation$price
pred_val_coverage <- interval_dataframe %>% summarize(cover = sum(price >= lwr & price <= upr)/n())

cat("The RMSE of the AIC Model on the Validation set is $", 
    sqrt(mean((ames_validation$price - predict_final_val)**2)), "and ", 
    pred_val_coverage[1,1], " of the 95% confidence intervals contain the true price.")
## The RMSE of the AIC Model on the Validation set is $ 19405.04 and  0.9475754  of the 95% confidence intervals contain the true price.

The validation RMSE for the final AIC model is a little lower than the test set RMSE ($19405.04 vs $20,663.24), indicating strong performance. The coverage probability of .948 is almost exactly what is expected (.950) for a 95% confidence interval. The coverage probability for the test set with the AIC model is .946 (calculation not shown). The results indicate that the model may be slightly permissive, but for the most part accurately reflects uncertainty and is a good representation of the data.


3.6 Conclusions

The final AIC selected model had an AdjR2 of 0.926 and performed well predicting house prices on unseen data. There was little overfitting, as the model performed only a little better (based on RMSE) on out-of-sample than within-sample data. The within-sample RMSE was $18606.88 and the out-of-sample RMSEs were $20,663.24 for the test set and $19405.04 for the validation set. The coverage probability was very close to the expected (~0.947 vs 0.950) indicating the confidence interval assumptions were met.

The EDA was effective in that most variables selected using the EDA were retained in the model. From the simple model, 6 of 6 and 5 of 6 of the variables were retained in the AIC and BIC models, respectively. For the expanded model, 17 of 21 and 14 of 21 were retained in the AIC and BIC models, respectively. The EDA was especially useful for understanding that “NA” did not necessarily correspond to missing data, that many variables performed better with log transformation, and that in general quality metrics were much more predictive of price than were metrics relating to quality.

The AIC model did appear to have slightly more overfitting than the BIC model (based on relative RMSE values for train and test sets) but it did have the lowest RMSE for the test set, outperforming the BIC model. Since the AIC model with the most variables performed best, it’s possible that that an even more complex model that better predicts prices could be generated. This could be accomplished by including more variables or using a different selection method such as Adjusted R2.

One drawback to including the Neighborhood is that it is a categorical variable with a large number of levels. When the test set was evaluated, one observation had to be removed because it contained a new (unseen) neighborhood. The use of categorical variables restricts the model to predicting only categories that have been observed in the train set.